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Abstract 

Recent  approaches  to  distributed  model  fitting 
rely  heavily  on  consensus  ADMM,  where  each 
node  solves  small  sub-problems  using  only  lo¬ 
cal  data.  We  propose  iterative  methods  that  solve 
global  sub-problems  over  an  entire  distributed 
dataset.  This  is  possible  using  transpose  re¬ 
duction  strategies  that  allow  a  single  node  to 
solve  least-squares  over  massive  datasets  without 
putting  all  the  data  in  one  place.  This  results  in 
simple  iterative  methods  that  avoid  the  expensive 
inner  loops  required  for  consensus  methods.  We 
analyze  the  convergence  rates  of  the  proposed 
schemes  and  demonstrate  the  efficiency  of  this 
approach  by  fitting  linear  classifiers  and  sparse 
linear  models  to  large  datasets  using  a  distributed 
implementation  with  up  to  20,000  cores  in  far 
less  time  than  previous  approaches. 

1  Introduction 

We  study  optimization  routines  for  problems  of  the  form 

minimize  f(Dx),  (1) 

where  D  £  R'm  x is  a  (large)  data  matrix  and  /  is  a  convex 
function.  We  are  particularly  interested  in  the  case  that  D  is 
stored  in  a  distributed  way  across  N  nodes  of  a  network  or 
cluster.  In  this  case,  the  matrix  D  =  (Df ,  ,  Z?^)T 

is  a  vertical  stack  of  sub-matrices,  each  of  which  is  stored 
on  a  node.  If  the  function  /  decomposes  across  nodes  as 
well,  then  problem  0  takes  the  form 

minimize  E  fi(Dix),  (2) 

i<N 

where  the  summation  is  over  the  N  nodes.  Problems  of  this 
form  include  logistic  regression,  support  vector  machines, 
lasso,  and  virtually  all  generalized  linear  models  01- 
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Most  distributed  solvers  for  equation  Q,  such  as  ADMM, 
assume  that  one  cannot  solve  global  optimization  problems 
involving  the  entire  matrix  D.  Rather,  each  node  alternates 
between  solving  small  sub-problems  involving  only  local 
data,  and  exchanging  information  with  other  nodes. 

This  work  considers  methods  that  solve  global  optimiza¬ 
tion  problems  over  the  entire  distributed  dataset  on  each 
iteration  using  transpose  reduction  methods.  Such  schemes 
exploit  the  following  simple  observation:  when  D  has 
many  more  rows  than  columns,  the  matrix  DTD  is  con¬ 
siderably  smaller  than  D.  The  availability  of  DrD  enables 
a  single  node  to  solve  least-squares  problems  involving  the 
entire  data  matrix  D.  Furthermore,  in  many  applications  it 
is  possible  and  efficient  to  compute  DTD  in  a  distributed 
way.  This  simple  approach  solves  extremely  large  opti¬ 
mization  problems  much  faster  than  the  current  state-of-the 
art.  We  support  this  conclusion  with  convergence  bounds 
and  experiments  involving  multi-terabyte  datasets. 

2  Background 

Much  recent  work  on  solvers  for  formulation  ([2])  has  fo¬ 
cused  on  the  Alternating  Direction  Method  of  Multipli¬ 
ers  (ADMM)  El  [2  SI,  which  has  become  a  staple  of 
the  distributed  computing  and  image  processing  literature. 
The  authors  of  a  propose  using  ADMM  for  distributed 
model  fitting  using  the  “consensus”  formulation.  Consen¬ 
sus  ADMM  has  additionally  been  studied  for  distributed 
model  fitting  0,  support  vector  machines  a,  and  numer¬ 
ous  domain-specific  applications  II.  Many  variations 
of  ADMM  have  subsequently  been  proposed,  including 
specialized  variants  for  decentralized  systems  ED.  asyn¬ 
chronous  updates  OH  132,  inexact  solutions  to  subprob¬ 
lems  m,  and  online/stochastic  updates  lfl2l. 

ADMM  is  a  general  method  for  solving  the  problem 
minimize  g(x)  +  h(y),  subject  to  Ax  +  By  =  0. 

(3) 

The  ADMM  enables  each  term  of  problem  0  to  be  ad¬ 
dressed  separately.  The  algorithm  in  its  simplest  form  be¬ 
gins  with  estimated  solutions  a;0,  y°,  and  a  Lagrange  mul- 
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tiplier  A0.  The  “scaled”  ADMM  then  generates  the  iterates 

!xk+ 1  =  argmin,,.  g(  x)  +  |||  Ax  +  Byk  +  Xk\\2 
yk+i  _  argmin^  +  ||| Axk+1  +  By  +  Afc||2 
Xk+i  =Xk  +  Axk+1  +  Byk+1 

(4) 

where  r  is  any  positive  stepsize  parameter.  Disparate  for¬ 
mulations  are  achieved  by  different  A1  B.  /,  and  g.  For 
example,  consensus  ADMM  0  addresses  the  problem 

minimize  fi{xi),  subject  to  Xi  =  y  for  all  i 

i 

which  corresponds  to  |3]l  with  B  =  (/,/,•••  I)T ,  A  =  I, 
h[x)  =  YZi  and  .9  =  o.  Rather  than  solving  a  global 

problem,  consensus  ADMM  performs  the  parallel  updates 

xki+1  =  argmin/i(xi)  +  ^|| -  yk ||2.  (6) 

Xi  " 

The  shared  variable  y  is  updated  by  the  central  server,  and 
Lagrange  multipliers  {A;}  force  the  {x,}  to  be  progres¬ 
sively  more  similar  on  each  iteration. 

3  Transpose  Reduction  Made  Easy: 
Regularized  Least-Squares 

Transpose  reduction  is  most  easily  understood  for  regular¬ 
ized  least-squares  problems;  we  discuss  the  general  case  in 
Section  |4]  Consider  the  problem 

minimize  J(x)  +  -||  Dx  —  6||2  (7) 

for  some  penalty  term  J.  When  J(x)  =  p\x\  for  some 
scalar  //.  this  becomes  the  lasso  regression  o.  Typical 
consensus  solvers  for  problem  0  require  each  node  to 
compute  the  solution  to  equation  (|6j,  which  is  here  given 
by 

xk+1  =  argmin  ^||  Ax*  -  bi\\2  +  ^||xj  -  yk ||2 

=  (£>fA+rJ)-1(I>f6i+r1/fc). 

During  the  setup  phase  for  consensus  ADMM,  each  node 
forms  the  matrix  Df  D, .  and  then  computes  and  caches  the 
inverse  (or  equivalently  the  factorization)  of  ( iff  Di  +  rl). 

Alternatively,  transpose  reduction  can  solve  0  on  a  single 
machine  without  moving  the  entire  matrix  D  to  one  place, 
greatly  reducing  the  required  amount  of  computation.  Us¬ 
ing  the  simple  identity 

\\\Dx  -  b\\2  =\(Dx  ~  biDx~b) 

=^xT(DTD)x-xTDTb  +  ^\\b\\2 


we  can  replace  problem  ([7J  with  the  equivalent  problem 

minimize  J(x)  + -xT  (DT  D)x  —  xT  DTb.  (8) 

To  solve  problem  (|8j,  the  central  server  needs  only  the 
matrix  DTD  and  the  vector  Drb.  When  D  is  a  “tall” 
matrix  D  £  Rmxra,  with  n  <C  m,  DT D  has  only  n2 
(rather  than  nm )  entries,  small  enough  to  store  on  a  sin¬ 
gle  server.  Furthermore,  because  DTD  =  YZ,  Dj  D,  and 
DTb  =  YZi  Dfbi,  the  matrices  DT D  and  DTb  is  formed 
by  having  each  server  do  computations  on  its  own  local 
Di ,  and  then  reducing  the  results  on  a  central  server. 

Once  DtD  and  DTb  have  been  computed  in  the  cloud 
and  cached  on  a  central  server,  the  global  problem  can  be 
solved  on  this  server.  This  is  done  using  either  a  single¬ 
node  ADMM  method  for  small  dense  lasso  (see  0  Section 
6.4)  or  a  forward-backward  (proximal)  splitting  method 
ED.  The  latter  approach  only  requires  the  gradient  of 
\xT{DTD)x-xTDTb,  which  is  given  by  DT Dx  —  Drb. 

4  Unwrapping  ADMM:  Transpose 
Reduction  for  General  Problems 

Transpose  reduction  can  be  applied  to  complex  prob¬ 
lems  using  ADMM.  On  each  iteration  of  the  proposed 
method,  least-squares  problems  are  solved  over  the  entire 
distributed  dataset.  In  contrast,  consensus  methods  use  sub¬ 
problems  involving  only  small  data  subsets. 

We  aim  to  solve  problem  0  by  adapting  a  common 
ADMM  formulation  from  the  imaging  literature  0116). 
We  begin  by  “unwrapping”  the  matrix  l)\  we  remove  it 
from  /  using  the  formulation 

minimize  f(y),  subject  to  y  =  Dx.  (9) 

Applying  the  ADMM  with  A  =  D,  B  =  — I ,  h  =  /,  and 
g  =  0  yields  the  following  iterates: 

f  xk+1  =  arg minx  || Dx  -  yk  +  Afe||2  =  D+(yk  -  Xk) 
l  yk+1  =  argmin yf{y)  +  §||£>xfe+1  -  y  +  Xk\\2 

[  Afe+1  =A k  +  Dxk+1  -  yk+1 . 

(10) 

The  x  update  solves  a  global  least  squares  prob¬ 
lem  over  the  entire  dataset,  which  requires  the 
pseudoinverse  of  D.  The  y  update  can  be  written 
yk+1  =  prox^(Dxfe+1  +  Xk,  r_1),  where  we  used 
the  proximal  mapping  of  /,  which  is  defined  as 
pro xf(z,S)  =  argmin yf{y)  +  ^||y  -  z||2.  Pro¬ 
vided  /  is  decomposable,  the  minimization  in  this  update 
is  coordinate-wise  decoupled.  Each  coordinate  of  yk+1 
is  computed  with  either  an  analytical  solution,  or  using  a 
simple  1 -dimensional  lookup  table  of  solutions. 
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4.1  Distributed  Implementation 

While  Transpose  ADMM  is  highly  effective  on  a  sin¬ 
gle  machine,  there  are  additional  benefits  in  large,  dis¬ 
tributed  datasets.  D  =  (Df,  Dj,  •  ■  •  ,  D^)T,  y  = 
(yf,  Vi,  ■  •  •  ,  VTn)T  and  A  =  (X[ ,  Af,  •  •  ■  ,  XTN)T  can  all 
be  distributed  over  N  nodes,  such  that  no  node  has  suffi¬ 
cient  access  to  solve  the  global  least  squares  problem  for 
xk+1.  This  is  where  we  exploit  transpose  reduction.  The 
constraint  in  ©  now  becomes  yi  =  l),x.  and  the  least- 
squares  x  update  in  (flQ|)  becomes 

xk+1  =  D+(yk- Xk)  =  ^  DfD^J  Di 

(11) 

Each  vector  Di(yk  —  Xk)  can  be  computed  locally  on  node 
i.  Multiplication  by  (]A  DJI),)'  (which  need  only  be 
computed  once)  takes  place  on  the  central  server.  The  dis¬ 
tributed  method  is  listed  in  Algorithm  [T]  Note  the  massive 
dimensionality  reduction  that  takes  place  when  DTD  = 
Dj Dt  is  formed. 


Algorithm  1  Transpose  Reduction  ADMM 
1:  Central  node:  Initialize  global  x°  and  r 
2:  All  nodes:  Initialize  local  {t/°},  {A°} 

3:  All  nodes:  Wt  =  DjD,,  Mi 
4:  Central  node:  W  =  (JA  Wi)_1 
5:  while  not  converged  do 
6:  All  nodes:  dk  =  Dj (yk  —  Xk),  Mi 

7:  Central  Node:  xk+1  =  W  ]T\  dk 

8:  All  nodes: 

Vi+1  =  argmin^.  /*(?/;)  +  §|| Dtxk+1  -  y^  +  Af ||2 
=  pro  Xfz(DiXk+1  +  Xk  ,r~1),Mi 
9:  All  nodes:  Xk+1  =  Xk  +  DiXk+l  -  yk+1 

10:  end  while 


(more  realistic)  situations  significantly  decreases.  Because 
transpose  reduction  solves  global  problems  over  the  entire 
dataset,  the  distribution  of  data  over  nodes  is  irrelevant, 
making  these  methods  insensitive  to  data  heterogeneity.  We 
discuss  theoretical  reasons  for  this  in  Section  |6T|  and  ex¬ 
plore  the  impact  of  heterogeneity  with  synthetic  and  real 
data  in  Section  [8] 


4.3  Splitting  Over  Columns 


When  the  matrix  D  is  extremely  wide  (m  <C  n),  it  often 
happens  that  each  server  stores  a  subset  of  columns  of  D 
rather  than  rows.  Fortunately,  such  problems  can  be  han¬ 
dled  by  solving  the  dual  of  the  original  problem.  The  dual 
of  the  sparse  problem  (fl4|)  is  given  by 

minimize  /*(a)  subject  to  \\DTa\\00  <  /i  (12) 

a 


where  /*  denotes  the  Fenchel  conjugate  itTTl  of  /.  For  ex¬ 
ample  the  dual  of  the  lasso  problem  is  simply 


minimize  — 1| o:  +  6|| 2  subject  to  \\DTa\\00  <  y,. 

a  2 

Problem  ([T2|>  then  reduces  to  the  form  (jTJ)  with 


D  = 


D1 


T  )  7  f(Z)k  — 


_  (  5II zk  +  bk\ |2,  for  1  <  k  <  m 
yX{zh),  for  k  >  to 


where  X(z)  is  the  characteristic  function  of  the  ball 
of  radius  y.  The  function  X(z)  is  infinite  when  M  >  M 
for  some  i,  and  zero  otherwise.  The  unwrapped  ADMM 
for  this  problem  requires  the  formation  of  DiDf  on  each 
server,  rather  than  Df  Di. 


5  Applications:  Linear  Classifiers  and 
Sparsity 


4.2  Heterogeneous  problems 

Transpose  reduction  ADMM  is,  in  a  sense,  the  opposite 
of  consensus.  Transpose  reduction  methods  solve  a  global 
data-dependent  problem  on  the  central  node,  while  the  re¬ 
mote  nodes  only  perform  proximal  operators  and  matrix 
multiplications.  In  contrast,  consensus  methods  solve  all 
data  dependent  problems  in  the  remote  nodes,  and  no  data 
is  ever  seen  by  the  central  node.  This  important  property 
makes  transpose  reduction  extremely  powerful  when  data 
is  heterogeneous  across  nodes,  as  opposed  to  homogeneous 
problems  where  each  node’s  data  is  drawn  from  identical 
distributions.  With  standard  homogeneous  Gaussian  test 
problems,  all  consensus  nodes  solve  nearly  identical  prob¬ 
lems,  and  thus  arrive  at  a  consensus  quickly. 

In  practical  applications,  data  on  different  nodes  often  rep¬ 
resents  data  from  different  sources  and  is  thus  not  iden¬ 
tically  distributed.  The  efficiency  of  consensus  in  such 


In  addition  to  penalized  regression  problems,  transpose  re¬ 
duction  can  train  linear  classifiers.  If  D  £  Rm x 71  contains 
feature  vectors  and  l  £  contains  binary  labels,  then  a 
logistic  classifier  is  put  in  the  form  (|9|  by  letting  f(z)  be 
the  logistic  loss  fir(z)  =  YJk=i  1°g(1  +  exp (-lkzk))- 

Problem  ©  also  includes  support  vector  machines,  in 
which  we  solve 

minimize  -  ||a;||2  +  Ch(Dx),  (13) 

where  C  is  a  regularization  parameter,  and  h  = 
max{l  —  lkzk,  0}  is  a  simple  “hinge  loss”  function. 
The  y  update  in  (|T0|)  becomes  very  easy  because  the  proxi¬ 
mal  mapping  of  h  has  the  closed  form 

prox;i(z,  S)k  =  zk  +  lk  max{min{l  -  lkzk ,  (5},  0}. 

Note  that  this  algorithm  is  much  simpler  than  the  consensus 
implementation  of  S  VM,  which  requires  each  node  to  solve 
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SVM-like  sub-problems  using  expensive  iterative  methods 
(see  Section[A]in  the  supplementary  material). 

Sparse  model  fitting  problems  have  the  form 

minimize  p\x\  +  f(Dx)  (14) 


for  some  regularization  parameter  y  >  0.  Sparse  problems 
can  be  reduced  to  the  form  (JTJ  by  defining 


D  = 


f(z)k  = 


fi\zk\ ,  for  1  <  k  <  n 
fk(zk),  for  k  >  n 


and  then  minimizing  f(Dx).  Experimental  results  are  pre¬ 
sented  in  Section  [8] 


6  Convergence  Theory 

Classical  results  prove  convergence  of  ADMM  but  provide 
no  rate  m.  More  recently,  rates  have  been  obtained  by 
using  an  unusual  measure  of  convergence  involving  the 
change  between  adjacent  iterates  03.  It  is  still  an  open 
problem  to  directly  prove  convergence  of  the  iterates  of 
ADMM  in  the  general  setting.  In  this  section,  we  take  a 
step  in  that  direction  by  providing  a  rate  at  which  the  gra¬ 
dient  of  0  goes  to  zero,  thus  directly  showing  that  xk  is 
a  good  approximate  minimizer  of  0-  We  do  this  by  ex¬ 
ploiting  the  form  of  transpose  reduction  ADMM  and  the 
analytical  tools  in  Ifl9l. 

Theorem  1.  If  the  gradient  of  f  exists  and  has  Lipschitz 
constant  L(V /),  then  ©  shrinks  the  gradient  of  the  ob¬ 
jective  function  0  with  global  rate 

\\V{f(Dxk)}\\2  =\\DTX7f(Dxk)\\2 

'C\\y°-DX*\\*  +  \\\0-\*\\* 

k 

where  C  =  (L(V/)  +  r)2 p(DT D)  is  a  constant  and 
p(DT D)  denotes  the  spectral  radius  of  DT D. 

Proof.  We  begin  by  writing  the  optimality  condition  for  the 
x-update  in  ([T()|: 

DT(Dxk+1-yk  +  Xk)  =  DT \k+l  +  DT  {yk+1  —  yk)  =  0. 

Note  we  used  the  definition  Afc+1  =  Afc  +  Dxk+1  —  yk+1 
to  simplify  0-  Similarly,  the  optimality  condition  for  the 
y-update  yields 

V/(yfe+1)  +  T{yk+1  -  Dxk+1  -  Xk)  =V/(yfc+1)  -  rAfc+1 

=0, 

or  simply  V/(yfc+1)  =  rXk+1.  Combining  this  with  ([6]) 
yields  DT\7f(yk)  =  tDt (yk  —  yk+1).  We  now  have 

dx{f(Dxk)}  =DTX7 f  (Dxk)  =  DTX7f(yk  +  Dxk  -  yk) 
=DTS7f(yk  +  Dxk  -  yk) 

-DTX7f(yk)  +  TDT(yk~yk+1) 


and  so  \\dx{f(Dxk)}\\  is  bounded  above  by 

\\DTX7f(yk+Dxk-yk)-DTX7f(yk)\\+T\\DT(yk-yk+1)\\ 

<  L(X7f)\\DT\\op\\Dxk  -  yk\\  +  r\\DT\\op\\yk  -  yk+1\\, 

where  ||Z?T||op  denotes  the  operator  norm  of  DT . 

We  now  invoke  the  following  known  identity  governing  the 
difference  between  iterates  of  ([4]) 

\\B(yk+1-yk)\\2  +  \\Axk+1  +  Byk+1\\2 

^  \\B(y°  —  y*)\\2  +  II  A°  A* || 2 

k  +  1 

(see  03,  Assertion  2.10).  When  adapted  to  the  problem 
form  0.  we  obtain 

\\yk+1  -  yk\\2  +  \\Dxk+1  -  yk+1\\2 

„  ||y°  —  Dx*\\2  +  || A°  —  A* || 2  (15) 

k  - (-  1 

It  follows  from  (fl~5|>  that  both  \\Dxk  —  yk\\  and  \\yk  —  yk+1 1| 
are  bounded  above  by  yj (||y°  —  Dx* ||2  +  ||A°  —  A*||2)/fc. 
Applying  this  bound  to  |6)  yields 

\\dx{f(Dxk)}\\  <\/C(\\y0  —  Dx*\\2  +  || A0  —  A*||2)/fc. 

We  obtain  the  result  by  squaring  this  inequality  and  noting 
that  ||0||2p  =  p(DT  D).  □ 

Note  that  logistic  regression  problems  satisfy  the  condi¬ 
tions  of  Theorem  [I]  with  L(V  f)  =  1/4.  Also,  better  rates 
are  possible  using  accelerated  ADMM  ll20l. 

6.1  Linear  convergence  analysis  and  heterogeneous 
data 

We  now  examine  conditions  under  which  transpose  re¬ 
duction  is  guaranteed  to  have  a  better  worst-case  perfor¬ 
mance  bound  than  consensus  methods,  especially  when 
data  is  heterogeneous  across  nodes.  We  examine  conver¬ 
gence  rates  for  the  case  of  strongly  convex  /,  in  which  case 
the  iterates  of  the  ADMM  Q  are  known  to  converge  R- 
linearly.  If  x*  and  A*  are  the  optimal  primal  and  dual  solu¬ 
tions,  then 

r||A(xfc+1  -  x*)||2  +  T-1||Afc+1  -  A* || 2 

<(1  +  6)-1  (T\\A(xk+1  -  x*)||2  +  r_1||Afc+1  -  A* || 2) 

for  some  5  >  0  ED.  If  we  denote  the  condition  number 
of  A  by  ka  and  the  condition  numbers  of  the  functions  / 
and  g  by  Kf  and  Kg,  then  <5  = 

Applying  this  result  to  consensus  ADMM  (A  =  I  and 
g(x)  =  Y2i  fi(DiXi))  where  all  are  identically  condi¬ 
tioned,  we  get 

*  _  minj  Xmin(Dj Df)* 

Vcon  —  1  i  • 

max^  A  max  {DjDtfn} 
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On  the  other  hand,  transpose  reduction  removes  the  ma¬ 
trix  1)  from  the  objective  function  and  corresponds  to 
Q  with  A  =  I  and  g{x)  =  We  thus  obtain 

Str  =  l/y/Kj  for  the  transpose  reduction  ADMM.  Note 
that  Str  >  Scon,  and  so  the  worst  case  performance  of  trans¬ 
pose  reduction  is  (significantly)  better  than  consensus.  The 
worst-case  linear  convergence  rate  of  transpose  reduction 
does  not  deteriorate  for  poorly  conditioned  D  because  the 
data  term  has  been  moved  from  the  objective  into  the  con¬ 
straints.  If  we  compare  Scon  to  5tr,  we  expect  the  conver¬ 
gence  of  consensus  methods  to  suffer  with  increased  num¬ 
bers  of  nodes  and  a  more  poorly  conditioned  D. 

7  Implementation  Details 

We  compare  transpose  reduction  methods  to  consensus 
ADMM  using  both  synthetic  and  empirical  data.  We  study 
the  transpose  reduction  scheme  for  lasso  (Section[3]l  in  ad¬ 
dition  to  the  unwrapped  ADMM  (Algorithm [lj  for  logistic 
regression  and  S  VM.  We  built  a  distributed  implementation 
of  both  the  transpose  reduction  and  consensus  optimization 
methods,  and  ran  all  experiments  on  a  large  Cray  super¬ 
computer  hosted  by  the  DOD  Supercomputing  Resource 
Center.  This  allowed  us  to  study  experiments  ranging  in 
size  from  very  small  to  extremely  large.  All  distributed 
methods  were  implemented  using  MPI.  Stopping  condi¬ 
tions  for  both  methods  were  set  using  the  residuals  defined 
in  0  with  erei  =  10-3,  and  eabs  =  10-6. 

Many  steps  were  taken  to  achieve  top  performance  of  the 
consensus  optimization  routine.  The  authors  of  0  sug¬ 
gest  using  a  stepsize  parameter  r  =  1;  however,  bet¬ 
ter  performance  is  achieved  by  tuning  this  parameter.  We 
tuned  the  stepsize  parameter  to  achieve  convergence  in  a 
minimal  number  of  iterations  on  a  problem  instance  with 
to  =  10,  000  data  vectors  and  n  =  100  features  per  vector, 
and  then  scaled  the  stepsize  parameter  up/down  to  be  pro¬ 
portional  to  m.  It  was  found  that  this  scaling  made  the  num¬ 
ber  of  iterations  nearly  invariant  to  n  and  m.  In  the  con¬ 
sensus  implementation,  the  iterative  solvers  for  each  local 
logistic  regression/SVM  problem  were  warm-started  using 
solutions  from  the  previous  iteration. 

The  logistic  regression  subproblems  were  solved  using  a 
limited  memory  BFGS  method  (with  warm  start  to  accel¬ 
erate  performance).  The  transpose  reduced  lasso  method 
(Section  [3])  requires  a  sparse  least-squares  method  to  solve 
the  entire  lasso  problem  on  a  single  node.  This  was  accom¬ 
plished  using  the  forward-backward  splitting  implementa¬ 
tion  fasta  mm. 

Note  that  the  consensus  solver  for  SVM  requires  the  solu¬ 
tion  to  sub-problems  that  cannot  be  solved  by  conventional 
SVM  solvers  (see  ©  in  the  supplemental  material),  so 
we  built  a  custom  solver  using  the  same  coordinate  descent 
techniques  as  the  well-known  solver  LIBSVM  l23ll.  By  us¬ 
ing  warm  starts  and  exploiting  the  structure  of  the  consen¬ 


sus  sub-problem,  our  custom  consensus  ADMM  method 
solves  the  problem  dramatically  faster  than  standard 
solvers  for  problem  (fT3)>.  See  Appendix  |A|  in  the  supple¬ 
mentary  materials  for  details. 

8  Numerical  Experiments 

To  study  transpose  reduction  in  a  wide  range  of  settings, 
we  applied  consensus  and  transpose  solvers  to  both  syn¬ 
thetic  and  empirical  datasets.  We  recorded  both  the  total 
compute  time  and  the  wallclock  time.  Total  compute  time 
is  the  time  computing  cores  spend  performing  calculations, 
excluding  communication;  wall  time  includes  all  calcula¬ 
tion  and  communication. 

8.1  Synthetic  Data 

We  study  ADMM  using  both  standard  homogeneous  test 
problems  and  the  (more  realistic)  case  where  data  is  het¬ 
erogeneous  across  nodes. 

Lasso  problems  We  use  the  same  synthetic  problems  used 
to  study  consensus  ADMM  in  0.  The  data  matrix  D  is  a 
random  Gaussian  matrix.  The  true  solution  xtrUe  contains 
10  active  features  with  unit  magnitude,  and  the  remaining 
entries  are  zero.  The  t\  penalty  //  is  chosen  as  suggested  in 
0—  i.e.,  the  penalty  is  10%  the  magnitude  of  the  penalty 
for  which  the  solution  to  0  becomes  zero.  The  observation 
vector  is  b  =  DxtrUe  +  V:  where  77  is  a  standard  Gaussian 
noise  vector  with  0  =  1. 

Classification  problems  We  generated  a  random  Gaussian 
matrix  for  each  class.  The  first  class  consists  of  zero-mean 
Gaussian  entries.  The  first  5  columns  of  the  second  ma¬ 
trix  were  random  Gaussian  with  mean  1,  and  the  remain¬ 
ing  columns  were  mean  zero.  The  classes  generated  by  this 
process  are  not  perfectly  linearly  separable.  The  i\  penalty 
was  set  using  the  “10%”  rule  used  in  0. 

Heterogeneous  Data  With  homogeneous  Gaussian  test 
problems,  every  node  solves  nearly  identical  problems,  and 
we  arrive  at  a  consensus  quickly.  As  we  saw  in  the  the¬ 
oretical  analysis  of  Section  |6.1|  consensus  ADMM  dete¬ 
riorates  substantially  when  data  is  heterogeneous  across 
nodes.  To  simulate  heterogeneity,  we  chose  one  random 
Gaussian  scalar  for  each  node,  and  added  it  to  Di. 

We  performed  experiments  by  varying  the  numbers  of 
cores  used,  feature  vector  length,  and  the  number  of  data 
points  per  core.  Three  representative  results  are  illustrated 
in  Figure  [T|  while  more  complete  tables  appear  in  supple¬ 
mentary  material  (Appendix  [B]).  In  addition,  convergence 
curves  for  experiments  on  both  homogeneous  and  hetero¬ 
geneous  data  can  be  seen  in  Figures  [2a] and  [2b] 

8.2  Empirical  Case  Study:  Classifying  Guide  Stars 

We  perform  experiments  using  the  Second  Genera¬ 
tion  Guide  Star  Catalog  (GSC-II)  112411.  an  astronomical 
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Amount  of  Data  (TB) 


Number  of  Cores 


Amount  of  Data  (TB) 


Number  of  Cores 


(a)  Logistic  regression  with  homogeneous  data. 

Experiments  used  100,000  data  points  of  1,000 
features  each  per  core. 


Amount  of  Data  (GB) 


(b)  Logistic  regression  with  heterogeneous  data. 

Experiments  used  100,000  data  points  of  1,000 
features  each  per  core.  Note  that  transpose  ADMM 
required  only  120  hours  of  compute  time  for  15  TB  of 
data. 


Amount  of  Data  (GB) 

250  500 


(c)  SVM  with  homogeneous  data.  Experiments  used  (d)  Lasso  with  heterogeneous  data.  Experiments 

50,000  data  points  of  100  features  each  per  computing  used  50,000  data  points  of  200  features  each  per 

core.  computing  core. 

Figure  1 :  Selected  results  from  Consensus  ADMM  (green)  and  Transpose  ADMM  (blue)  on  three  different  optimization  problems  of 
varying  sizes.  Every  core  stores  an  identically-sized  subset  of  the  data,  so  data  corpus  size  and  number  of  cores  are  related  linearly.  The 
top  horizontal  axis  denotes  the  total  data  corpus  size,  while  the  bottom  horizontal  axis  denotes  the  number  of  computing  cores  used. 


database  containing  spectral  and  geometric  features  for  950 
million  stars  and  other  objects.  The  GSC-II  also  classifies 
each  astronomical  body  as  “star”  or  “not  a  star.”  We  train  a 
sparse  logistic  classifier  to  discern  this  classification  using 
only  spectral  and  geometric  features. 

A  data  matrix  was  compiled  by  selecting  all  spectral  and 
geometric  measurements  reported  in  the  catalog,  and  also 
“interaction  features”  made  of  all  second-order  products. 
After  the  addition  of  a  bias  feature,  the  resulting  matrix  has 
307  features  per  object,  and  occupies  1.8  TB  of  space. 

Experiments  recorded  the  global  objective  as  a  function  of 
wall  time.  As  seen  in  the  convergence  curves  (Figure  [2c]) 
transpose  ADMM  converged  far  more  quickly  than  con¬ 
sensus.  We  also  experiment  with  storing  the  data  matrix 
across  different  numbers  of  nodes.  These  experiments  il¬ 
lustrated  that  this  variable  had  little  effect  on  the  relative 
performance  of  the  two  optimization  strategies;  transpose 
methods  remained  far  more  efficient  regardless  of  number 
of  cores.  See  TableQ] 


Note  the  strong  advantage  of  transpose  reduction  over  con¬ 
sensus  ADMM  in  Figure[2c]  This  confirms  the  observations 
of  Sections  |6.1| and  |8.1|  where  it  was  observed  that  trans¬ 
pose  reduction  methods  are  particularly  powerful  for  het¬ 
erogeneous  real-world  data,  as  opposed  to  the  identically 
distributed  matrices  used  in  conventional  synthetic  experi¬ 
ments. 

9  Discussion 

Transpose  reduction  methods  required  substantially  less 
computation  time  than  consensus  methods  in  all  experi¬ 
ments.  As  seen  in  Figure]!]  Figure[2]  and  the  supplementary 
table  (Appendix  |Bj,  performance  of  both  methods  scales 
nearly  linearly  with  the  number  of  cores  and  amount  of 
data  used  for  classification  problems.  For  lasso  (Figure [Id), 
the  runtime  of  transpose  reduction  appears  to  grow  sub- 
linearly  with  the  number  of  cores.  This  contrasts  with  con¬ 
sensus  methods  that  have  a  strong  dependence  on  this  pa¬ 
rameter.  This  is  largely  because  the  transpose  reduction 
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(a)  Homogeneous  data.  Experiments  used 
7,200  computing  cores  with  50,000  data 
points  of  2,000  features  each. 


(b)  Heterogeneous  data.  Experiments  used 
7,200  computing  cores  with  50,000  data 
points  of  2,000  features  each. 


(c)  Empirical  star  data.  Experiments  used 
2,500  cores  to  classify  1.8TB  of  data.  Con¬ 
sensus  did  not  terminate  until  1160  sec. 


Figure  2:  Logistic  regression  objective  function  vs  wallclock  time  for  Consensus  ADMM  (green)  and  Transpose  ADMM  (blue). 


Cores  T-Wall  T-Comp 

C-Wall  C-Comp 

2500  0:01:06  11:35:25 
3000  0:00:49  12:10:33 
3500  0:00:50  12:17:27 
4000  0:00:45  12:38:24 

0:24:39  31d,19:59:13 
0:21:43  32d,  2:44:11 
0:17:01  30d,  7:56:19 
0:29:53  40d,  13:38:19 

Table  1 :  Wall  clock  and  total  compute  times  for  logistic  regres¬ 
sion  on  the  1.8TB  Guide  Star  Catalog.  “T-”  denotes  results  for 
transpose  reduction  and  “C-”  denotes  consensus.  Times  format  is 
(days,)  hours:mins:secs. 

method  solves  all  data-dependent  problems  on  a  single  ma¬ 
chine,  whereas  consensus  optimization  requires  progres¬ 
sively  more  communication  with  larger  numbers  of  cores 
(as  predicted  in  Section[6T|). 

Note  that  transpose  reduction  methods  need  more  startup 
time  for  some  problems  than  consensus  methods  because 
the  local  Gram  matrices  Dj Di  must  be  sent  to  the  central 
node,  aggregated,  and  the  result  inverted;  this  is  not  true  for 
the  lasso  problem,  for  which  consensus  solvers  must  also 
invert  a  local  Gram  matrix  on  each  node,  though  this  at 
least  saves  startup  communication  costs.  This  startup  time 
is  particularly  noticeable  when  overall  solve  time  is  short, 
as  in  Figure  [2b]  Note  that  even  for  this  problem  total  com¬ 
putation  time  and  wall  time  was  still  substantially  shorter 
with  transpose  reduction  than  with  consensus  methods. 


[2c|  Table [T],  which  contains  empirical  data  and  is  thus  not 
uniformly  distributed. 

9.2  Communication  &  Computation 

Transpose  reduction  leverages  a  tradeoff  between  commu¬ 
nication  and  computation.  When  N  nodes  are  used  with 
a  distributed  data  matrix  D  £  Rmx™,  each  consensus 
node  transmits  Xi  £  R"  to  the  central  server,  which  totals 
to  O(Nn)  communication.  Transpose  reduction  requires 
0(m)  communication  per  iteration,  which  is  often  some¬ 
what  more.  Despite  this,  transpose  reduction  is  still  highly 
efficient  for  two  reasons.  First,  consensus  requires  inner  it¬ 
erations  to  solve  expensive  sub-problems,  while  transpose 
reduction  does  not.  Second,  transpose  reduction  methods 
stay  synchronized  better  than  consensus  ADMM,  making 
communication  more  efficient  on  synchronous  architec¬ 
tures.  The  iterative  methods  used  by  consensus  ADMM  for 
logistic  regression  and  SVM  sub-problems  do  not  termi¬ 
nate  at  the  same  time  on  every  machine,  especially  when 
the  data  is  heterogeneous  across  nodes.  Consensus  nodes 
must  block  until  all  nodes  become  synchronized.  In  con¬ 
trast,  Algorithm[T]requires  the  same  computations  on  each 
server,  allowing  nodes  to  stay  synchronized  naturally. 

10  Conclusion 


9.1  Effect  of  Heterogeneous  Data 


When  data  is  heterogeneous  across  nodes,  the  nodes  have 
a  stronger  tendency  to  “disagree”  on  the  solution,  taking 
longer  to  reach  a  consensus.  This  effect  is  illustrated  by 


a  comparison  between  Figures  Eq  and  2b  or  Figures  la 


and  lb  where  consensus  methods  took  much  longer  to 
converge  on  heterogeneous  data  sets.  In  contrast,  because 
transpose  reduction  solves  global  sub-problems  across  the 
entire  distributed  data  corpus,  it  is  relatively  insensitive  to 
data  heterogeneity  across  nodes.  In  the  same  figures,  trans¬ 
pose  reduction  results  were  similar  between  the  two  sce¬ 
narios,  while  consensus  methods  required  much  more  time 
for  the  heterogeneous  data.  This  explains  the  strong  advan¬ 
tage  of  transpose  reduction  on  the  GSC-II  dataset  (Figure 


We  introduce  transpose  reduction  ADMM  —  an  iterative 
method  that  solves  model  fitting  problems  using  global 
least-squares  subproblems  over  a  distributed  dataset.  Theo¬ 
retical  convergence  rates  are  superior  for  the  new  approach, 
particularly  when  data  is  heterogeneous  across  nodes.  This 
is  illustrated  by  numerical  experiments  using  synthetic 
and  empirical  data,  both  homogeneous  and  heterogeneous, 
which  demonstrate  that  the  transpose  reduction  can  be  sub¬ 
stantially  more  efficient  than  consensus  methods. 
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Supplementary  Material 


A  SVM  Sub-steps  for  Consensus  Optimization 


A  common  formulation  of  the  support  vector  machine  (SVM)  solves 

minimize  -\\x\\2  +  Ch(Dx)  (16) 

where  C  is  a  regularization  parameter,  and  h  is  a  simple  “hinge  loss”  function  given  by  h(z)  =  Y^IL t  max{l  —  IkZk,  0}. 
The  proximal  mapping  of  h  has  the  form  pToxh(z,  S)k  =  Zk  +lk  max{min{l  —  IkZk,  £>},  0}.  Using  this  proximal  operator, 
the  solution  to  the  y  update  in  (jTOj  is  simply  yk+1  =  prox^  ( Dxk+l  +  Xk,  ^-)  .  Note  that  this  algorithm  is  much  simpler 
than  the  consensus  implementation  of  SVM,  which  requires  each  node  to  solve  the  sub-problem 


minimize 


Ch(Dx)  +  T-\\x 


(17) 


Despite  the  similarity  of  this  problem  to  the  original  SVM  (fl6|,  this  problem  form  is  not  supported  by  available  SVM 
solvers  such  as  LIBSVM  Ii23l  and  others.  However,  techniques  for  the  classical  SVM  problem  can  be  easily  adapted  to 
solve  m 

A  common  numerical  approach  to  solving  (fl6l>  is  the  attack  its  dual,  which  is 


minimize 

a4e[o,C] 


-\\ArLa\\2  -  aT  1  =  ^  aiajliljAiA^ 


(18) 


Once  ©  is  solved  to  obtain  a*,  the  solution  to  ([T3|  is  simply  given  by  w*  =  LT a.  The  dual  formulation  (fl8|  is 
advantageous  because  the  constraints  on  a  act  separately  on  each  coordinate.  The  dual  is  therefore  solved  efficiently  by 
coordinate  descent,  which  is  the  approach  used  by  the  popular  solver  LIBSVM  |23l.  This  method  is  particularly  powerful 
when  the  number  of  support  vectors  in  the  solution  is  small,  in  which  case  most  of  the  entries  in  a  assume  the  value  0  or 

C. 

In  the  context  of  consensus  ADMM,  we  must  solve 

minimize  -  ||u>||2  +  Ch(Aw,  l)  +  ^  ||u;  —  z\\2 .  (19) 

Following  the  classical  SVM  literature,  we  dualize  this  problem  to  obtain 

minimize  - 1| ATLa||2  —  aT  ((1  +  r)l  —  tLz)  .  (20) 

a;e[o,C]  2 

We  then  solve  (|20|)  for  a*,  and  recover  the  solution  via 

.  AT  La  +  tz 

w  =  - - - . 


We  solve  (|20|)  using  a  dual  coordinate  descent  method  inspired  by  li23l.  The  implementation  has  0(M )  complexity  per 
iteration.  Also  following  ll23l  we  optimize  the  convergence  by  updating  coordinates  with  the  largest  residual  (derivative) 
on  each  pass. 

Because  our  solver  does  not  need  to  handle  a  “bias”  variable  (in  consensus  optimization,  only  the  central  server  treats  the 
bias  variables  differently  from  other  unknowns),  and  by  using  a  warm  start  to  accelerate  solve  time  across  iterations,  our 
coordinate  descent  method  significantly  outperforms  even  LIBSVM  for  each  sub-problem.  On  a  desktop  computer  with  a 
Core  i5  processor,  LIBSVM  solves  the  synthetic  data  test  problem  with  m  =  100  datapoints  and  n  =  200  features  in  3.4 
seconds  (excluding  “setup”  time),  as  opposed  to  our  custom  solver  which  solves  each  SVM  sub-problem  for  the  consensus 
SVM  with  the  same  dimensions  (on  a  single  processor)  in  0.17  seconds  (averaged  over  all  iterations).  When  m  =  10000 
and  n  =  20,  LIBSVM  requires  over  20  seconds,  while  the  average  solve  time  for  the  custom  solver  embedded  in  the 
consensus  method  is  only  2.3  seconds. 
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B  Tables  of  Results 

In  the  following  tables,  we  use  these  labels: 

•  N:  Number  of  data  points  per  core 

•  F:  Number  of  features  per  data  point 

•  Cores:  Number  of  compute  cores  used  in  computation 

•  Space:  Total  size  of  data  corpus  in  GB  (truncated  at  GB) 

•  TWalltime:  Walltime  for  transpose  method  (truncated  at  seconds) 

•  TCompute:  Total  computation  time  for  transpose  method  (truncated  at  seconds) 

•  CWalltime:  Walltime  for  consensus  method  (truncated  at  seconds) 

•  CCompute:  Total  computation  time  for  consensus  method  (truncated  at  seconds) 
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Logistic  regression  with  homogeneous  data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

2000 

800 

596 

0:00:53 

6:19:14 

0:01:36 

17:25:18 

50000 

2000 

1600 

1192 

0:00:58 

12:40:24 

0:01:51 

1  day  10:51:33 

50000 

2000 

2400 

1788 

0:01:00 

19:05:13 

0:01:52 

2  days  4:21:25 

50000 

2000 

3200 

2384 

0:01:00 

1  day  1:30:18 

0:01:41 

2  days  21:46:28 

50000 

2000 

4000 

2980 

0:00:58 

1  day  7:58:24 

0:01:39 

3  days  15:17:51 

50000 

2000 

4800 

3576 

0:00:58 

1  day  14:27:31 

0:02:31 

4  days  8:49:58 

50000 

2000 

5600 

4172 

0:01:00 

1  day  21:10:38 

0:02:13 

5  days  2:16:56 

50000 

2000 

6400 

4768 

0:01:03 

2  days  3:46:42 

0:02:08 

5  days  19:39:40 

50000 

2000 

7200 

5364 

0:01:21 

2  days  10:36:36 

0:01:47 

6  days  13:12:59 

100000 

1000 

2000 

1490 

0:02:09 

11:50:56 

0:01:58 

2  days  0:28:08 

100000 

1000 

4000 

2980 

0:01:32 

1  day  0:05:30 

0:04:14 

4  days  0:58:47 

100000 

1000 

6000 

4470 

0:01:40 

1  day  12:20:57 

0:02:00 

6  days  1:36:20 

100000 

1000 

8000 

5960 

0:00:42 

2  days  0:42:49 

0:03:33 

8  days  1:59:14 

100000 

1000 

10000 

7450 

0:01:01 

3  days  5:30:41 

0:02:43 

10  days  2:30:10 

100000 

1000 

12000 

8940 

0:01:16 

4  days  0:50:36 

0:02:54 

12  days  2:59:08 

100000 

1000 

14000 

10430 

0:01:33 

4  days  16:42:20 

0:05:00 

14  days  3:36:58 

100000 

1000 

16000 

11920 

0:01:18 

5  days  3:40:44 

0:03:19 

16  days  4:11:34 

100000 

1000 

18000 

13411 

0:01:07 

5  days  6:45:44 

0:05:29 

18  days  4:56:02 

100000 

1000 

20000 

14901 

0:01:17 

6  days  21:44:52 

0:03:14 

20  days  5:36:16 

5000 

2000 

4800 

357 

0:00:33 

4:04:11 

0:00:26 

21:01:22 

10000 

2000 

4800 

715 

0:00:26 

7:51:06 

0:01:22 

1  day  21:24:47 

15000 

2000 

4800 

1072 

0:00:38 

11:23:22 

0:01:37 

2  days  19:42:30 

20000 

2000 

4800 

1430 

0:00:42 

15:15:01 

0:01:30 

3  days  19:27:24 

25000 

2000 

4800 

1788 

0:00:42 

18:59:04 

0:01:48 

4  days  17:24:59 

30000 

2000 

4800 

2145 

0:00:47 

22:53:25 

0:02:04 

5  days  16:30:28 

35000 

2000 

4800 

2503 

0:00:57 

1  day  2:43:48 

0:02:46 

6  days  15:10:40 

40000 

2000 

4800 

2861 

0:00:54 

1  day  6:22:51 

0:02:42 

7  days  14:58:02 

45000 

2000 

4800 

3218 

0:00:57 

1  day  10:05:17 

0:03:02 

8  days  15:11:42 

50000 

2000 

4800 

3576 

0:01:02 

1  day  14:28:30 

0:03:24 

9  days  15:51:21 

20000 

500 

4800 

357 

0:00:05 

2:18:21 

0:00:35 

20:51:20 

20000 

1000 

4800 

715 

0:00:12 

5:33:31 

0:01:40 

1  day  18:43:21 

20000 

1500 

4800 

1072 

0:00:25 

9:44:07 

0:01:08 

2  days  20:08:20 

20000 

2000 

4800 

1430 

0:00:31 

15:10:01 

0:01:29 

3  days  19:28:56 

20000 

2500 

4800 

1788 

0:01:23 

1  day  12:24:25 

0:03:30 

4  days  20:53:53 

20000 

3000 

4800 

2145 

0:01:50 

1  day  20:29:59 

0:03:44 

5  days  19:45:31 

20000 

3500 

4800 

2503 

0:02:27 

2  days  5:40:09 

0:03:56 

6  days  19:44:54 

20000 

4000 

4800 

2861 

0:03:03 

2  days  16:50:51 

0:03:46 

7  days  18:17:21 

20000 

4500 

4800 

3218 

0:04:00 

3  days  3:35:02 

0:04:28 

8  days  19:49:26 

20000 

5000 

4800 

3576 

0:04:52 

3  days  16:50:21 

0:04:44 

9  days  23:56:16 
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Logistic  regression  with  heterogeneous  data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

2000 

800 

596 

0:00:56 

6:14:57 

0:09:25 

3  days  19:56:38 

50000 

2000 

1600 

1192 

0:01:01 

12:28:12 

0:09:35 

7  days  19:00:17 

50000 

2000 

2400 

1788 

0:00:58 

18:43:11 

0:09:35 

11  days  13:26:10 

50000 

2000 

3200 

2384 

0:00:58 

1  day  1:09:09 

0:09:39 

15  days  10:33:19 

50000 

2000 

4000 

2980 

0:01:23 

1  day  7:34:22 

0:09:49 

19  days  6:45:31 

50000 

2000 

4800 

3576 

0:01:11 

1  day  13:51:15 

0:34:30 

77  days  5:23:50 

50000 

2000 

5600 

4172 

0:01:29 

1  day  20:20:50 

0:34:38 

90  day  19:10:12 

50000 

2000 

6400 

4768 

0:01:01 

2  days  2:55:20 

0:35:31 

103  days  19:09:22 

50000 

2000 

7200 

5364 

0:01:14 

2  days  9:38:02 

0:10:26 

34  days  20:11:28 

100000 

1000 

2000 

1490 

0:01:31 

11:15:47 

0:26:49 

23  days  21:30:59 

100000 

1000 

4000 

2980 

0:01:03 

22:44:45 

0:25:23 

48  days  17:23:23 

100000 

1000 

6000 

4470 

0:00:42 

1  day  10:38:14 

0:24:38 

73  days  15:10:07 

100000 

1000 

8000 

5960 

0:00:43 

1  day  22:25:35 

0:25:08 

98  days  12:53:22 

100000 

1000 

10000 

7450 

0:00:56 

2  days  10:13:27 

0:25:39 

123  days  0:26:26 

100000 

1000 

12000 

8940 

0:01:24 

2  days  22:10:47 

0:25:00 

146  days  22:00:35 

100000 

1000 

14000 

10430 

0:01:16 

4  days  11:33:53 

0:26:27 

171  days  8:40:10 

100000 

1000 

16000 

11920 

0:00:56 

3  days  22:59:09 

0:25:18 

195  days  19:54:41 

100000 

1000 

18000 

13411 

0:01:26 

4  days  11:34:10 

0:26:03 

218  days  19:17:19 

100000 

1000 

20000 

14901 

0:01:59 

4  days  23:59:15 

0:26:27 

243  days  4:55:47 

Lasso  with  heterogeneous  data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

200 

800 

59 

0:00:12 

0:01:45 

0:00:37 

0:04:55 

50000 

200 

1600 

119 

0:00:02 

0:03:31 

0:00:47 

0:10:56 

50000 

200 

2400 

178 

0:00:02 

0:05:14 

0:01:14 

0:17:50 

50000 

200 

3200 

238 

0:00:00 

0:07:00 

0:01:22 

0:25:24 

50000 

200 

4000 

298 

0:00:04 

0:09:00 

0:01:36 

0:33:49 

50000 

200 

4800 

357 

0:00:11 

0:10:25 

0:01:57 

0:43:29 

50000 

200 

5600 

417 

0:00:10 

0:12:09 

0:02:07 

0:55:47 

50000 

200 

6400 

476 

0:00:07 

0:13:48 

0:02:19 

1:04:51 

50000 

200 

7200 

536 

0:00:09 

0:15:31 

0:02:39 

1:19:22 

50000 

1000 

800 

298 

0:00:04 

0:33:28 

0:05:20 

2:58:02 

50000 

1000 

1600 

596 

0:00:18 

1:06:33 

0:06:23 

6:00:37 

50000 

1000 

2400 

894 

0:00:25 

1:39:50 

0:08:28 

9:04:25 

50000 

1000 

3200 

1192 

0:00:09 

2:12:14 

0:08:34 

12:07:04 

50000 

1000 

4000 

1490 

0:00:08 

2:46:27 

0:09:52 

15:13:18 

50000 

1000 

4800 

1788 

0:00:21 

3:24:38 

0:13:28 

18:11:34 

50000 

1000 

5600 

2086 

0:00:10 

3:50:29 

0:14:55 

21:25:49 

50000 

1000 

6400 

2384 

0:00:06 

4:26:31 

0:16:11 

1  day  0:27:56 

50000 

1000 

7200 

2682 

0:00:11 

4:56:57 

0:17:11 

1  day  3:34:19 
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SVM  with  homogeneous  data 


N 

F 

Cores 

Space(GB) 

TWalltime 

TCompute 

CWalltime 

CCompute 

50000 

20 

48 

0 

0:00:01 

0:00:46 

0:02:45 

2:01:12 

50000 

20 

96 

0 

0:00:01 

0:01:32 

0:02:47 

4:03:05 

50000 

20 

144 

1 

0:00:02 

0:02:19 

0:02:49 

5:58:08 

50000 

20 

192 

1 

0:00:02 

0:03:06 

0:02:45 

7:56:14 

50000 

20 

240 

1 

0:00:02 

0:03:53 

0:02:51 

9:54:05 

50000 

50 

48 

0 

0:00:03 

0:01:23 

0:05:14 

3:44:06 

50000 

50 

96 

1 

0:00:03 

0:02:47 

0:05:19 

7:26:30 

50000 

50 

144 

2 

0:00:03 

0:04:11 

0:05:25 

11:07:51 

50000 

50 

192 

3 

0:00:07 

0:05:38 

0:05:25 

14:54:03 

50000 

50 

240 

4 

0:00:03 

0:07:00 

0:05:25 

18:26:10 

50000 

100 

48 

1 

0:00:05 

0:02:20 

0:09:28 

6:25:55 

50000 

100 

96 

3 

0:00:05 

0:04:40 

0:09:56 

12:49:20 

50000 

100 

144 

5 

0:00:05 

0:07:04 

0:09:45 

19:09:22 

50000 

100 

192 

7 

0:00:06 

0:09:25 

0:09:53 

1  day  1:27:47 

50000 

100 

240 

8 

0:00:05 

0:11:46 

0:10:06 

1  day  8:08:18 

Star  data 


Cores 

TWalltime 

TCompute 

CWalltime 

CCompute 

2500 

0:01:06 

11:35:25 

0:24:39 

31  days  19:59:13 

3000 

0:00:49 

12:10:33 

0:21:43 

32  days  2:44:11 

3500 

0:00:50 

12:17:27 

0:17:01 

30  days  7:56:19 

4000 

0:00:45 

12:38:24 

0:29:53 

40  days  13:38:19 

